This report explores a dataset containing several variables that holds 11 chemical proprieties of 4898 white wines and its quality grades (where 0 is very bad and 10 is very good). The wines were graded by experts.

My primary goal is to find out which chemical proprieties have a significant impact on wine quality, at least from the experts perspective.


Brief description of attributes:

1 - Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily). (tartaric acid - g / dm^3)

2 - Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. (acetic acid - g / dm^3)

3 - Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines. (g / dm^3)

4 - Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. (g / dm^3)

5 - Chlorides: the amount of salt in the wine. (sodium chloride - g / dm^3

6 - Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. (mg / dm^3)

7 - Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)

8 - Density: the density of water is close to that of water depending on the percent alcohol and sugar content. (g / dm^3)

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

10 - Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. (potassium sulphate - g / dm3)

11 - Alcohol: the percent alcohol content of the wine. (% by volume)

12 - Quality (score between 0 and 10).


## [1] 4898   13
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

Quality

The histogram of the variable “quality” suggests that the variable is numerical and discrete, it’s almost normally distributed, with a little right skewness, which suggests that there are fewer 7+ than -5 quality grades.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000


Correlation matrix

This plot shows us that only a few variables should directly explain the variance of quality, and those are “alcohol” and “density”. However, that does not mean that the other variables are not important to determinate the quality of a given wine. For example, some other variables like “residual.sugar” have a strong correlation with “alcohol”, which may indicate that “residual.sugar” might be indirectly related to quality.


Alcohol

This variable’s distribution is right skewed, which means that most white whines from this dataset contain something around 10.5% of alcohol, the mean and median confirm that statement. Given the fact that alcohol has the greatest correlation with quality, I want to investigate this relationship furthermore.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

As alcohol has the strongest correlation with quality, I imagined it would be a nice idea to check out this relationship in a scatterplot, and not surprisingly, we can see that, at least in this dataset, wines with more alcohol percentage tends to have higher quality grades.


Density

Density summary, histogram and scatterplot x quality. I expected to see a bigger slope in the scatterplot given the correlation coefficient from the correlation matrix, so I detected a few outliers, removed them and finally got to see what I was expecting.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390


Residual Sugar

Residual sugar summary and scatterplot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Creating new variable “Sweetness” based on residual sugar

According to wikipedia (https://en.wikipedia.org/wiki/Sweetness_of_wine), and the European Union terms for wine, there’s a table to classficate the sweetness of wines based on its residual sugar (g/l), so I created a categorical variable called “sweetness”, that can hold the values: Dry, Medium Dry, Medium and Sweet.

Something I find quite odd is that this dataset contains just one sweet white wine out of 4898 wines, even though a quick search at google tells me that sweet white wines are very common (https://winefolly.com/review/beginners-white-wines-list/).


Volatile acidity

It was said that high levels of volatile acidity can lead to an unpleasent, vinegar taste. The scatterplot of volatile acidity proves that indeed, the higher the level of v.a., the lower the quality grade.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000


Total Sulfur Dioxide

The description of total sulfur dioxide states that free SO2 concentrations over 50ppm makes SO2 evident in the nose and taste of wine. That’s why I used subset to split the wines where tsd > 50 and tsd <= 50 and generated two different scatterplots. The first one (where tsd <= 50) tells me that the correlation between tsd and quality is irrelevant, because the points are too sparse and the margin of error is huge. The second (where tsd > 50) says that the correlation is negative. What this means is that when the SO2 is evident in the nose and taste, it becomes a problem in terms of quality grade, and the more concentration you have, the worse is the quality grade.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Total Sulfur Dioxide > 50


pH

In chemistry, pH is a logarithmic scale used to specify the acidity or basicity of an aqueous solution. (https://en.wikipedia.org/wiki/PH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Creating new variable “Acidity” based on pH

According to this website (https://winefolly.com/review/understanding-acidity-in-wine/), there’s an informal categorical classification of acidity based on its pH, so I created a categorical variable called “acidity”, that can hold the values: Sweet, Light-bodied and Regular.

As said, this classification is informal, so it shouldn’t be a critical factor, but rather additional information.

df$acidity <- ifelse(df$pH <= 3.09, '1.SWEET',
                     ifelse(df$pH > 3.09 & df$pH <= 3.5, '2.LIGHT-BODIED',
                     ifelse(df$pH > 3.5, '3.REGULAR','NULL')))

Sampling

I sampled 500 rows to reduce overplotting in some specific plots. (Used seed: 20082018)

set.seed(20082018)
df_sample_ids <- sample(df$X, 500)
s_df <- subset(df, df$X %in% df_sample_ids)

Other Multivariate analysis

Density x Alcohol by Quality

The matrix correlation indicates that the correlation between density and alcohol is very big (~ -0.8), and the scatterplot confirms that. More alcohol means less density, which is reasonable because alcohol’s density is about 786kg/m^3. In comparison with water, it’s 208 kg/m^3 less dense. I also noticed the big variance in the boxplot of quality grade 6 (which is expected because most rows have grade 6), and the dection of several outliers.

Free Sulfur Dioxide x Total Sulfur Dioxide by Quality

These plots revealed a few outliers and showed a relevant positive correlation between free sulfur dioxide and total sulfur dioxide, which makes senses because they are related.

Density x Residual Sugar by Sweetness

These plots shows us the positive correlation between residual sugar and density, as well as the categorical classification of sweetness based on residual sugar.


Final Plots and Summary


Plot 1

Comments on Plot 1

The correlation matrix suggested that the variable that could better explain the variance of the Quality Grades is Alcohol, and this scatterplot proves that the correlation is indeed positive and significant. Using alpha = 1/6 makes it easier to see where the points are really concentrated without abusing of transparency, and using position_jitter adds a bit of noise to the x axis so the plot doesn’t look too much like a bar plot.


Plot(s) 2

Comments on Plot(s) 2

The 1st plot shows the distribution of the categorical variable I created, called sweetness. It classifies the sweetness of a given wine as the European Union legislation says so. I was surprised to see that such a big dataset (4898 entries) only has one sweet wine in it.

The other one explores residual sugar relation with quality, and we can see that different sweetness categories have different impact on wine quality. Take the medium dry label for example, it holds the largest variation and the highest correlation, which suggests that if a white wine belongs to the “Medium Dry” sweetness category, the more residual sugar it has, the worse is its quality, opposing to the other two labels.


Plot 3

Comments on Plot 3

I didn’t use a regression line here because the plot I’ll just quote what I’ve already said, because I have nothing to add on that.

“The matrix correlation indicates that the correlation between density and alcohol is very big (~ -0.8), and the scatterplot confirms that. More alcohol means less density, which is reasonable because alcohol’s density is about 786kg/m^3. In comparison with water, it’s 208 kg/m^3 less dense. I also noticed the big variance in the boxplot of quality grade 6 (which is expected because most rows have grade 6), and the dection of several outliers.”


Reflections

Here are a few conclusions I’ve had after analysing this dataset:

In general, I don’t think this dataset can produce a good enough quality predictor based on wine’s chemical properties. The relationship between nearly all the variables (except for alcohol and density) and quality is just too noisy and sparse, they can’t explain quality’s variance enough, and I have two different thoughts on that: